Recovery in Massively Parallel Systems

نویسندگان

Geert Deconinck

J. Vounckx

R. Lauwereins

چکیده

The objective of ESPRIT-project 6731 FTMPS [1] is to develop techniques and system software to integrate Fault Tolerance in Massively Parallel Systems [2]. This covers the whole range from error detection, over fault-diagnosis and fault isolation to system and application recovery. Important is the research for applicability in massively parallel systems as well as the development of system software that may be commercialized in future products. The project-partners are: Parsytec Computer GmbH (D), British Aerospace Ltd. (UK), Katholieke Universiteit Leuven (B), Universität-GH Paderborn (D) (recently replaced by the Medizinische Universität zu Lübeck), Universität Erlangen-Nürnberg (D) and Universidade de Coimbra (P). Although the Parsytec systems (the PowerXplorer is one of them) have been the development hardware, the developed methodologies and implementations have been kept as hardware independent as possible.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

RECOVERY IN MASSIVELY PARALLEL SYSTEMS 1 Recovery in Massively Parallel

متن کامل

Facing up to the Inevitable: Intelligent Error Recovery in Massively Parallel Processing in Memory Architectures

Massively parallel “Processing-In-Memory” (PIM) architectures have been shown to yield increases in performance due to their “memory-centric” nature. However, as PIM is still a developing technology, advanced issues such as error detection and failure recovery have not yet been addressed. We describe the application of concepts found in our multi-agent system, ADE, to PIM, incorporating its mec...

متن کامل

A User-triggered Checkpointing Library for Computationintensive Applications

We propose a method to incorporate coordinated checkpointing and rollback in high performance computing applications on massively parallel computers. A library allows the user to specify which data-items (including files) belong to the contents of the checkpoint, and to trigger the checkpointing in the application. The recovery-line management on the distributed disk system takes care of which ...

متن کامل

Massively Parallel Execution Model and Massively Parallel Architecture

The purposes for the research and development of the RWC massively parallel computer project are (1) to e ciently support exible and integrated computation which are research targets in RWC Project, and (2) to pursue a general purpose massively parallel system e ciently supporting multiple programming paradigms, and (3) to realize a stand{alone system which has a mature operating system. For th...

متن کامل

A Software Implemented Fault-tolerance Layer for Reliable Computing on Massively Parallel Computers and Distributed Computing Systems

A novel architecture for a software-implemented fault-tolerance layer for application reliability on massively parallel computers and distributed computing systems is proposed. This is the rst attempt at providing a purely software-based, user-level solution for fault detection, reconnguration, and recovery in a parallel environment. The symmetrically distributed, multi-tiered layer envelopes u...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Recovery in Massively Parallel Systems

نویسندگان

چکیده

منابع مشابه

RECOVERY IN MASSIVELY PARALLEL SYSTEMS 1 Recovery in Massively Parallel

Facing up to the Inevitable: Intelligent Error Recovery in Massively Parallel Processing in Memory Architectures

A User-triggered Checkpointing Library for Computationintensive Applications

Massively Parallel Execution Model and Massively Parallel Architecture

A Software Implemented Fault-tolerance Layer for Reliable Computing on Massively Parallel Computers and Distributed Computing Systems

عنوان ژورنال:

اشتراک گذاری